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Abstract 

Community detection in graphs has been extensively studied both in theory and in applications. 
However, detecting communities in hypergraphs is more challenging. In this paper, we propose a ten¬ 
sor decomposition approach for guaranteed learning of communities in a special class of hypergraphs 
modeling social tagging systems or folksonomies. A folksonomy is a tripartite 3-uniform hypergraph 
consisting of (user, tag, resource) hyperedges. We posit a probabilistic mixed membership community 
model, and prove that the tensor method consistently learns the communities under efficient sample 
complexity and separation requirements. 

Keywords: Community models, social tagging systems/folksonomies, mixed membership models, tensor 

decomposition methods. 


1 Introduction 


Folksonomies or social tagging systems (IChakraborty et al.Ll2012h have been hugely popular in recent years. 
These are tripartite networks consisting of users, resources and tags. The resources can vary according to 
the system. For instance, in Delicious, the URLs are the resources, in Flickr, they are the images, in LastFm, 
they are the music files, in MovieLens, they are the reviews, and so on. The collabo rative annotation of these 
resou rces by users with descriptive keywords, enables faster search and retrieval (IChakraborty and Ghoshl 


20131). 


The role of community detection in folksonomies cannot be overstated. Online social tagging systems 
are growing rapidly and it is important to group the nodes (i.e. us ers, resources a nd tags) for scalable 
operations in a nu mber of applications such as personalized search dXu et all 120081) . resource and friend 


recommendations (IKonstas et all 120091) . and so on. Moreover, learning communities can provide an under¬ 


standing of community formation behavior of humans, and the role of communities in human interaction 
and collaboration in online systems. 

Folksonomies are special instances of hypergraphs. A folksonomy is a tripartite 3-uniform hypergraph 
consisting of hyperedges between users, resources and tags. Scalable community detection in hypergraphs 
is in general challenging, and most previous works are limited to pure membership models, where a node 
belongs to at most one group. This is highly unrealistic since users have multiple interests, and the tags 
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and resources have multiple contexts or topics. A few works which do consider overlapping communities 
in folksonomies are heuristic without any guarantees and do not incorporate any statistical modeling (see 
Section fOl for details). 

In this paper, we propose a novel probabilistic approach for modeling folksonomies, and propose a 
guaranteed approach for detecting overlapping communities in them. A naive model for folksnomies would 
result in a large number of model parameters, and make learning intractable. Here we present a more 
scalable approach where realistic conditional independence constraints are imposed, leading to scalable 
modeling and tractable learning. 

Our model is a hypergraph extension of t he pop ular mixed membership stochastic blockmodel (MMSB), 


introduced by Airoldi et. al (lAiroldi et al.L 120081) . We impose additional conditional independence con¬ 


straints, which are natural for social tagging systems. We term our model as mixed membership stochastic 
folksonomy (MMSF). When hypergraphs are generated from such a class of MMSFs, we show that the 
hyper-edges can be much more informative about the underlying communities, than in the graph setting. 
Intuitively, this is because the hyper-edges represent multiple views of the the hidden communities. In this 
paper, we show that these properties can be exploited for learning via spectral approaches. 


1.1 Summary of Results 

We develop a practically relevant mixed membership hypergraph model and propose novel methods to 
learn them with guarantees. We posit a probabilistic model for generation of hyper-edges {r, u, t} between 
resources r, users u and tags t. We impose natural conditional independence assumptions that conditioned on 
the community memberships of individual nodes, the hyperedge generations are independent. In addition, 
we assume that the users select tags for a given resource, based on the context in which the resource is 
accessed. For instance, consider the resource as a paper that falls both in theoretical and applied machine 
learning, as shown in Figure |2l If a user accesses the resource under the context of theory, he/she uses 
tags that are indicative of theory. Note that we allow the users and tags to be in multiple communities; 
however, the actual realization of an hyper-edge depends only on the context in which the resource was 
accessed. Depending on what kind of user is tagging the paper, the likelihood of choosing various tags such 
as application, latent variable model etc changes. The conditional independence assumption states that once 
a user accesses the paper in certain context (e.g. looking for applications), the probability of using tags in a 
category (e.g. applications, experiments) only depends on that context. There are many other such examples. 
For example, a movie can be a drama about a political figure. A person who is mostly into politics will watch 
this movie in the context of politics and use political tags (for example name of the person, specific political 
evenfs fhaf where illusfrafed in fhe movie), while a person who is more info drama genre will use drama fo 
lag fhe movie. 

While communily models on general hypergraphs is NP hard, our setting is geared towards the setting 
of folksonomies with users, resources and tags, and the assumptions we make naturally hold in this set- 
tin g. Importantly, we a how fo r general distributions for mixed community memberships. The earlier work 
by lAnandkumar et al.l (l2014ah on MMSB models on graphs is limited to the Dirichlet distribution. Note 
that the Dirichlet assumption for community memberships can be limiting and cannot model general cor¬ 
relations in memberships. Without the Dirichlet assumption, the earlier techniques, when applied directly, 
would yield tensors in the Tucker form, which do not possess a unique decomposition and thus, the com¬ 
munities cannot be learnt from the tensor forms. In addition, our moment forms are different since it is the 
hypergraph setting and conditional independence assumptions are different. Thus, earlier work on MMSB 
cannot be directly applied here. 

In addition, we impose weak assumptions on the distribution of the community memberships. This is 
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required since the memberships ar e in general not identifiable when they are mixed. While the original 
MMSB model ( Arroldi et ah . 2008h assumes that the communities are drawn from a Dirichlet distribution, 
here, we do not require such a strong parametric assumption. Here, we impose a weak assumption that a 
certain fraction of resource nodes are “pure” and belong to a single community. This is reasonable to expect 
in practice. We establish that the communities are identifiable under these natural assumptions, and can be 
learnt efficiently using spectral approaches. 

Here, we propose a novel algorithm to detect pure nodes belonging to a single community. The presence 
of pure nodes is natural to expect in practice and does not require the Dirichlet assumption. Our method 
consists of two main routines. First, we design a simple rank test to identify pure resource nodes. The 
algorithm involves first projecting hyperedges to subspace of top-k eigenvectors. It then involves performing 
rank test on the matricization of connectivity vectors of each resource node, where rows correspond to users 
and columns correspond to tags. We can then exploit these detected pure nodes to form tensors that can be 
decomposed efficiently to yield the communities for all the nodes (and not just the pure nodes). We prove 
that our proposed method correctly recovers the parameters of the MMSF model when exact moments are 
input. This two stage algorithm is expected to have much wider applicability than the MMSB model which 
is limited to the Dirichlet distribution. For this general model, we show a tight sample complexity that 
n > can recover the communities. 

For the first step, we construct a matrix for each resource node, consisting of its edges to users and 
tags. We show that this matrix is rank-1 in expectation (over the hyperedges) for a pure resource node. This 
property enables us to identify such pure nodes. We then construct a 3-star count tensor using these estimated 
pure resource nodes. We count the pure resource nodes, which are common to triplets of (user,tag) tuples 
to form the tensor. We show that in expectation this tensor has a CP decomposition form, and requiring this 
decomposition yields the community memberships after some simple post-processing steps. 

We then carefully analyze the perturbation bounds under empirical moments, and show that the com¬ 
munities can be accurately recovered under some natural assumptions. The perturbation analysis for this 
step is novel since it requires analyzing the effect of standard spectral perturbations on matricization and 
the subsequent rank test. We use subexponential Hanson Wright inequalities to obtain tight guarantees for 
this step. These assumptions determine how the number of nodes n is related to the number of communities 
k, and a lower bound on the separation p — q, where p denotes the connectivity within the same commu¬ 
nity, while q denotes the connectivity across different c ommunities. Such req uirements have been imposed 
bef ore in the graph setting, for stochastic block models (lYudong et al.l 1201211 and mixed membership mod¬ 
els ( Anandkumar et al. . 2014ah . Here, we show that for MMSF, the requirement is stronger, since intuitively, 
we require concentration on a hypergraph instead of a graph. We employ sub-exponential forms of Han¬ 
son Wright’s inequality to get tight bounds in the sparse regime, where the connectivity probabilities p, q 
are small. Thus, we obtain efficient guarantees for recovering mixed membership communities from social 
tagging networks. 

We establish that for the success of rank test, if p ~ q, we need the network size to scale as n = 
D (A:^) (when the correlation matrix of community membership distribution is well-conditioned). For the 
case where q < p/k, we require n = Q {k"^)- This is intuitive as the role of q is to make the different 
community components non-orthogonal for the rank test, i.e., q acts as noise. Therefore, a smaller q results 
in better guarantees. For the success of tensor decomposition method, we require n = Q, {k^), when p, q 
are constants, in the well-conditioned setting. Note that in c omparison, for learning mi xed membership 


stochastic block model graphs, we require n = D (k^), from Anandkumar et al. ( 2014all . which is lower 


sample complexity. This is because we need to learn more number of parameters in the hypergraph setting. 
Moreover, for sparse graphs, the parameters p, q decay with n, and we also handle this setting, and provide 
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the precise bounds in Section |4l 


1.2 Related Work 


There is an extensive body of work for com munity detection in graph s. Popular me t hods with guarantees 
include spe ctral clustering (iMcSherrvLbOOlh and convex optimization (lYudong et al.Ll2012h . For a detailed 
survey, see (lAnandkumar et aLLl2014ah . However, these methods cannot handle mixed membership models, 
where a node can belong to more than one community. 


Our algorithm is based on the tensor decomposition approach of (lAnandkumar et al.L l2013ll for pair¬ 
wise MMSB model in graphs. The method has been implemented for many real-world datasets and has 
shown sign ificant improyemen t in running times and accuracy over the state of art stochastic variational 


techniques dHuang et al.Ll2013h . The tensor consists of third order moments in the form of counts of 3-star 


subgraphs, i.e., a star subgraph consisting of three leaves, for each triplet of leaves. The MMSB model as¬ 
sumes a Dirichlet distribution for community memberships, and in this case, a modified 3-star count tensor 
is used. It is shown that this tensor has a CP-decomposition form, and the components of the decomposition 
can be used to learn the parameters of the MMSB model. However, this method cannot be extended easily 
to general distributions, beyond the Dirichlet assumption, since for general distributions, the 3-star count 
tensor only has a Tucker decomposition form, and not a CP form. In general, the model parameters are not 
identifiable from a Tucker form. Thus, in graphs, mixed membership models cannot be easily learnt when 
general distributions (beyond the Dirichlet distribution) for mixed memberships are assumed. In this paper, 
we show that in the hypergraph setting, more general distributions of community memberships can be learnt, 
when certain conditional independence relationships are assumed for hyper-edge generation. 

Another limitation of the MMSB model is that due to the Dirichlet assumption, only normalized commu¬ 
nity memberships can be incorporated. However, in this case, the mixed nodes (i.e. tho se belonging to more 
than o ne community) are less densely connected than the pure nodes, as pointed out by dYang and Leskoved. 
20131). In contrast, in our paper, we can handle un-normalized community memberships vectors (in a 
weighted graph), since we do not make the Dirichlet assumption, and thus, this limitation is not present. 
However, for simplicity, we present the results in the normalized setting. 

Scalable community detection in hypergraphs is in general challenging and most previous wo r ks are 
limited to pure membership rnodels, where a node belongs to at most one grou p dBrinkmeier et al.L l2007l: 
Lin et al.Ll2009l : lMurataLl2010l : lNeubauer and Obermayerl.l2009l : I VazquezL 1200911 . Clustering in multipartite 
hypergraphs can be see n as extensions of th e co-clustering of matrices, where rows and columns are simul¬ 
taneously clustered. In ( .legelka et ah . 2009b . extensions of co-clustering to the tensor setting is considered. 
However, this setting can only handle pure communities, where a node belongs to at most one community. A 
few works which do consider mixed comr nunities in hypergraphs are heuristic withou t any guarantees, and 


do no t incorporate any statistical modeling ( Wang et ah . 20101 : Chakrabortv et ah . 2012 : Papadopoulos et al. 


2010). They mostly use modularity based scores without providing any guarantees. In this paper, we present 
the first guaranteed method for learning communities in mixed membership hypergraphs. 


2 Mixed Membership Model for Folksonomies 

Setup: We consider folksonomies modeled as tripartite 3-uniform hypergraphs over three sets of nodes, 

viz., set of users U, set of tags T and set of resources R. An hyperedge {u, t, r} occurs when user u tags 
resource r with tag t. For convenience, we will consider a matricized version of the {0,1} hyper-adjacency 
tensor, denoted by G G {0, which indicates the presence of hyper-edges. The reason behind 
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Resource 1 


55% T-ML 
45% A-ML " S ~ 



Communities: 

1) T-ML: Theoretical Machine Learning 

2) A-ML: Applied Machine Learning 


User! 


Tags 

Applications (20% T-ML, 80% A-ML) 

Latent Variable Models (35% T-ML, 65% A-ML) 
Theoretical Guarantees (95% T-ML, 5% A-ML) 


__________ / 


Figure 1: Overview of MMSF model for an example of machine learning articles (resources) tagged by 
users. One article (resource) and the corresponding tags by two users are shown. Two communites of The¬ 
oretical machine learning and Applied machine learning are assumed. The mixed community membership 
of resources, users and tags are also shown. 


considering matricization along the resource mode will soon become clear. We use the notation G{{u,t},r) 
to denote the entry corresponding to the hyper-edge {u,t,r}, and G({ (7, T},r) to denote the column vector 
corresponding to the set of hyper-edges {(/, T, r}. 

We consider models with k underlying (hidden) communities and let [/c] := {1, 2,..., A:}. For node i, 
let TTj G denote its community membership vector, i.e., the vector is supported on the communities to 
which the node belongs. Define Iff/ := [vTj : z G 17] G denote the set of column vectors denoting the 

community memberships of users in U, and similarly define Tfr and 11^^. Let If := [vrj : z G (7 U T U i?]. 

We now provide a statistical model to explain the presence of hyper-edges {zz, t, r} among users, tags 
and resources through the community memberships. We consider a mixed memberships model, where 
there are multiple communities for users, tags and resources. Intuitively, users belonging to certain groups 
(i.e. interested in certain topics) will tend to select resources mainly comprised of those topics. The tags 
employed by the users are dependent on the contextual category of the resource selected by the user. This 
intuition is formalized under our proposed statistical model below. 

Let G M*' be a coordinate basis vector which denotes the community membership of user u 

when posting tag t and resource r, and similarly let Zt^{u,r} denote the memberships of resource 

r and tag t when participating in the hyperedge {u, t, r}. 

Let P G be the community connectivity matrix, where Pij denotes the probability that a user in 
community i selects a resource in community j. Similarly, let P G denote a matrix such that each 

entry Pij denotes the probability that a tag in community i is associated with resource in community j. 

The proposed mixed membership stochastic folksonomy (MMSF) is as follows: 

• For each node in z G U UT U R, draw its community membership vector G M^, i.i.d. from some 
distribution /tt. 

• For each triplet {zz, t, r}, draw coordinate basis vectors 2n->{z,r} ~ Multinomial(7rn), ~ 

Multinomial (vTi) and ~ Multinomial (vr^) in a conditionally independent manner, given Ft. 
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• Draw random variables 


~ Bernoulli(zJ'_^|j i}) 

Br^f,u ~ ( 1 ) 

The presence of hyper-edge G{{u, f}, r) is given by the product 

d{{u, t},r) = Br^u-,t ■ Br^f^u- (2) 


The use of variables Zr-).{u,t} allows for context-dependent selection of group 

memberships as in the MMSB model. Given a resource and its context, a user may choose to access the 
resource, and probability of using a tag on a resource depends on context of the tag and the resource. Given 
the context of user, tag and the resource, these two events are independent. In order to have a hyper-edge, 
we need both events to happen and this explains Eqn. Q. 

Ours is a resource centric model, where a resource can be regarded as comprising of many topics or com¬ 
munities. Which tags get associated with the resource is dependent on the context of the resource Zr-).{u,t} 
and the tag and similarly, which user selects a resource is dependent on the context of the user 

{Zu^^t,r}) the resource Zj.^{u,t}- The hyper-edges are drawn according to (O and thus, matricization 
along the resource mode is convenient for analysis. Our model is resource centric and not user centric. The 
intuition is that the tags associated with a resource are dependent on the context that the resource is being 
accessed and the likelihood of the user accessing a resource is dependent on his/her current group and the 
context of the resource. Figure |2] provides an instance of a hypergraph where the resource is a paper and 
communities consist of theoretical and applied mach i ne lea rning. 

Unlike the pairwise MMSB model (lAiroldi et al.L l2008h . where the edges are conditionally independent 


given the community memberships, in the proposed MMSF model, the edges Br^t\u and Br^u-,t contained 
in the hyperedge {u, t, r} are not conditionally independent given the community memberships, since they 
are selected based on the common context of the resource r. Thus, the MMSF model is capturing 

dependencies beyond the pairwise MMSB model. At the same time, the MMSF model has conditionally 
independent hyperedges given the community memberships, which leads to tractable learning. 

We do not take the approach of modeling hyperedges directly, i.e., through a community connectivity 


tensor in P E 


hkxkxk 


, where Pa,b,c would give the probability that a user in community a would have 


an hyperedge with resource b and tag c. This would lead to unknown parameters, while our model has 
only k‘^ unknown parameters. Moreover, if the user at a certain point is interested in some topic (i.e. draws 
Zu^^t,r} in some community), then he looks for resources and tags having significant membership in that 
topic (modeled through draws of nnd Zr^^y^t}) nnd this will generate the hyper-edge u —)• {f, r}. 

We assume that the community vectors are drawn i.i.d. from a general unknown distribution: for i £ [n], 


VTi 


i.i.d. 


supported on the k — 1-dimensional simplex A 


fe-i 


:= {tt E 


,7r i E 


[0,l],j;7r(r) = l}. 


The performance of our learning algorithms will depend on the distribution of tt. In particular, we assume 
that with probability p, a realization of tt is a coordinate basis vector, and thus, about p fraction of the nodes 
in the network are pure, i.e. they belong mostly to a single community. In this paper, we investigate how the 
tractability of learning the communities depends on p. 
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Figure 2: Our moment-based learning algorithm uses 3-star count tensor from set X to sets A, B, C. 


3 Proposed Method 

Notation: For a matrix M, if M = UDV~^ is the SVD of M, let k-svd(M) := UDV^ denote the 

A;-rank SVD of M, where D is limited to top-A: singular values of M. A matrix A G is stacked as a 
vector a G by the vec(-) operator, 

a = vec(A) a((ii - l)q + i 2 )) = A(ii, ^ 2 )- 

The reverse matricization operation is denoted by mat(-), i.e. above A = mat(a). Let A * B denote 
the Hadamard or entry-wise product. Let k-svd(M) of a matrix M denote its restriction to top-A; singular 
values, i.e. if M = U, k -svd(M) = UkAkVjJ, which denote the restriction of the subspaces and the 
singular values to the top-A: ones. 

In this paper, we consider the problem of learning the community vectors vTi, for i G [n], given a 
realization of the (matricized) hyper-adjacency matrix G G We will employ a clustering-based 

approach on the hyper-adjacency matrix, but employ a different clustering criterion than the usual distance 
based clustering, our method is shown in Algorithm [T] 

Our method relies on finding pure resource nodes and using them to find communities for fhe resource, 
fag and users. A pure resource node is a node fhaf is mainly corresponding fo one hidden communify. 
Therefore, finding fhaf node paves fhe way for finding resource communities. In addifion, since fhis is a 
resource-cenfric model, looking af fhe subsef of hyper graph wifh pure resources, all fags and all users, 
suffices fo find fhe communities for users and fags as well. Since we assume knowledge of communify 
connecfivify mafrices, we can learn communify memberships for mixed resource nodes as well. We now 
provide fhe defails of our proposed mefhod. 

Projection matrix; We partition the resource set R into two parts X and Y to avoid dependency issues 
between the projection matrix and the projected vectors, and this is standard for analysis of spectral cluster¬ 
ing. Now let k-svd(G({f7, T}, V)) = M^AfcVjJ and we employ Proj := as the projection matrix. 

We project the vectors G{{U,T}, x) for x G V using this projection matrix. 

Rank test on projected vectors: In the usual spectral clustering method, once we have projected vectors 
Proj G({f7, T},x) G any distance based clustering can be employed to classify the vectors into 

different (pure) communities. However, when mixed membership nodes are present, this method fails. We 
propose an alternative method which considers a rank test on the (matricized form of) the projected vectors. 
Specifically consider fhe mafricized form mat(Proj G({?7, T}, x) G rI^I^I^I and check whefher 

cji (mat (Proj G({(7, T}, x))) > Ti and cj 2 (mat (Proj G({(7, T}, x))) < r 2 
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and if so, declare the node x E X as a pure node. Interchange roles of X and Y and similarly find pure 
nodes in Y. 


Learning using estimated pure nodes: On ce the pure nodes in resourc e set R are found, we can employ 
the tensor decomposition method, proposed in (lAnandkumar et all l2014ah . for learning the mixed member¬ 
ship communities of all the nodes. The pure nodes are employed to obtain averaged 3-star subgraph counts. 
Partition {U,T} into three sets A, B,C as shown in Figure |2] The 3-star subgraph count is defined as 


rR^A,B,c 4 E Gir, A)-^ ® G(r, ® G{r, , 
1^1 reR 


(3) 


where R denofes fhe sef of pure resource nodes. The mefhod is explained in Appendix iBl 


Reconstruction after power method: Since we do not have access to the exact moments we need to do 
additional processing: the estimated community membership vectors are then subject to thresholding so that 
the weak values are set to zero. This modification makes our reconstruction strong as we are considering 
sparse community memberships. Also note that assuming knowledge of community connectivity matrices, 
we can learn community memberships for mixed resource nodes as well. This is shown in Algorithm [3] in 
the Appendix. 


Algorithm 1 {11} ■(— LeamMixedMembership(G, k, ti,T 2 ) 

Input: Hyper-adjacency matrix G € ^ A: is the number of communities, and n, T 2 are thresholds 

for rank test. 

Output: Estimates of the community membership vectors H. 

1: Partition the resource set R randomly into two parts X, Y. 

2: R =Pure Resource Nodes Detection(A, Y, U, T). 

3: n TensorDecomp(G({t/, Tj, •), i?) 

4: Return ft. 


Procedure 2 Pure Resource Nodes Detection _ 

Input: X,Y,U,T. ^ 

1: Construct Projection matrix Proj = M^MJ, where k-svd(G({[/,r},y)) = MkAkV^. 
2: Set of pure nodes i? •(— 0. 

3: for X E X do 

4: if cri(mat(Proj G({t/, T},x))) > ti and cj 2 (mat(Proj G({f7, Tj, x))) < r 2 then 

5: i? •(— i?U {xj. {Note mat(Proj G{{U,T},x)) E is matricizationj 

6 : end if 

7: end for 

8: Interchange roles of X and Y and find pure nodes in Y. 

9: Return R. 













4 Analysis of the Learning Algorithm 

Notation: Let O(-) denote O(-) up to poly-log factors. We use the term high probability to mean with 
probability 1 — n~^ for any constant c > 0. 

4.1 Assumptions 

For simplicity, we assume that the community memberships of resources, tags and users are drawn from the 
same distribution. Further, we consider equal expected community sizes, i.e. E[7r] = 1/k-l^. Additionally, 
we assume that the community connectivity matrices P, P are homogeneou^ and equal 

P = P = {p - q)I + qll^. ( 4 ) 

These simplifications are merely for convenience, and can be easily removed. 

Requirement for success of rank test: We require thaH 

n = n ^c7fc(E[7r7r’^])“3 • K(E[7r7r’^])"2 • ^ 

where k{-) denotes the condition number and cTfc(-) denotes the fe"’ singular value. 

We assume that maxjgj;;.] T^x{i) = 1 — e, e = 0(1) and hence there exists no node such that their vr is 
between 1 and TTmax- 


Requirement for success of tensor decomposition: Recall that the tensor method uses only pure re¬ 
source nodes. Let p be the fraction of such pure resource nodes. Let Wi := P[7rr(f) = l|r E R], For 
simplicity, we assume that wi = 1/k. Again, this can be easily extended. 

We require the separation in edge connectivity p — qto satisfy 

{P - qf '/k 

P ^fc(lEK7rT]) 

Intuitively this implies that there should be enough separation between connectivity within a community 
and connectivity across communities. 



Dependence on p, q: Note that for the rank test, ([5]), in the well-conditioned setting we have (E[7r7r^]) = 

0{l/k). Then if p ~ q, we need n = Q {k^)- For the case where q < p/k, we will require n = Cl 
This is intuitive as the role of q is to make the components non-orthogonal, i.e., q acts as noise. Therefore, 
smaller q results in better guarantees. For the tensor decomposition method, in the well-conditioned 
setting, if we have n = Cl [k^), this means p, q are constants. Alternatively, for sparse graphs, we want p, q 
to decay. According to the constraints, we need a larger n. This is intuitive as in case of sparse graphs we 
need fewer observations and less information about unknown community memberships. Therefore, we need 
more samples^ _ 

Note that lAnandkumar et all (l2014ah require n = 0{k‘^) while we need n = 0{k^). The reason is 
that we are estimating a hypergraph (they estimate a graph) and we are estimating more parameters in this 
model. Therefore, we need more samples. 


*Our results can be easily extended to the case when P and P are full rank. 
O represent fi, O up to poly-log factors. 
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4.2 Guarantees 


We now establish main results on recovery at the end of our algorithm. We first show that under the assump¬ 
tions in the previous section, we obtain an f '2 guarantee for recovery of the membership weights of source 
nodes in each community. We should note that this result can be extended to recovery of membership for 
tag and user nodes as well. In this case, there will be additional perturbation terms. 

Let n be the reconstruction of communities (of resources, users and tags) using the tensor method in 
Algorithm [ 3 ] in the Appendix, but before thresholding. For a matrix M, let (M)* denote the i* row. Recall 
that (n)* denotes the memberships of all the nodes in the z* community, since Ft G We 

have the following result: 

Theorem 1 (Reconstruction of communities (before thresholding)) 'We have w.h.p. 

ma. II (ft)' - {Il)% - O (. p) 

i&lk] ^{p-qY ) 

Remark: Note that the I 2 norm above is taken over all the nodes of the network and we expect this to be 

0{^/n) if error at each node is 0(1). Assuming E[7r7r^] is well conditioned and when p,p,q = n(l), we 
get a better guarantee that Cjr = 0{Vk). 

Now we further show that when the distribution of vr is “mostly” sparse, i.e. each node’s membership 
vector does not have too many large entries, we can improve the above £2 guarantees into ii guarantees via 
thresholding. 

Specifically, assuming that the distribution of vr satisfies 

P[7r(i) > r] < ^log(l/r), Vi G [/c] 

for T = 0{e-n ■ -), we have fhe following resulf. This is equivalenf fo fhe case fhaf fhe fail r is exponentially 
small in k, i.e., sparsify. 


Remark: Dirichlef disfribufion satisfies fhis assumption when Oj < 1, where ctj represenf fhe Dirich- 

lef concenfrafion paramefers. 


Theorem 2 {£i guarantee for reconstruction after thresholding) We have 


lit _ nt||^ = o 




~ ( y/n ■ p ■ K{E[Tr7r~^]) \ 

V vpip -<1? y ’ 


where IT* is the result of thresholding with r = 0(ejr • ^)- 


( 8 ) 


Remark: Nofe fhaf fhe ii norm above is faken over all fhe nodes of fhe nefwork and we expecf fhis fo be 

0{n) if error af each node is 0(1). Assuming E[7r7r^] is well conditioned and when p,p,q = n(l), we gel 
a befler guaranlee of 0{^/n). Hence, we oblain good error guarantees in bolh cases on £i and £2 norms. 

For proof of fhe Theorems, see Appendix O 
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Overview of Proof 


5.1 Analysis of Graph Moments under MMSF 

5.1.1 Overview of Kronecker and Khatri-Rao products: 

We require the notions of Kronecker A® B and Khatri-Rao products AQ B between two matrices A and 
B. First we define the Kronecker product A® B between matrices A G and B G Its (i, j)* 

entry is given by 

(A® fi)ij := i = {ii,i 2 ] G [ni] x [ 712 ],j = G [h] x [k 2 ]. 

Thus, for two vectors a and b, we have 

(a ® 6)i := i = { 71 , 72 } G [ni] x [ 712 ]. 

For the Khatri-Rao product AQ B between matrices A G and B G we have its (i,})* as 

AQ B{i,j) := Ai^jBi^j, i = { 71 , 72 } G [tii] x [ 772 ], j G [A:]. 

In other words, we have 

AQ B := [aiQbi 02 ® 62 ■ ■ .UkQbk], 

where Oj, bi are the 7* columns of A and B. Note the difference between the Kronecker and the Khatri-Rao 
products. While the Kronecker product expands both the number of rows and columns, the Khatri-Rao 
product preserves the original number of columns. We will also use another simple fact that 

{AqB){CQD) = ACQBD. (9) 

5.1.2 Result on Correctness of the Algorithm 

Recall that P G [0,1]^^^ denotes the connectivity matrix between communities of users and resources and 
P G [0,1]^^^ denotes the corresponding connectivity between communities of resources and tags. Define 

F := n^P, F := HJP. (10) 

Lef Fu = TT^P be fhe row vector corresponding fo user u and similarly Ft corresponds fo fag t. Similarly, 
lef Fa = nJ^P be fhe sub-mafrix of P. 

We now provide a simple resulf on fhe average hyper-edge connecfivify and fhe form of fhe 3-slar counfs, 
given fhe communify memberships. 

Proposition 1 (Form of Graph Moments) Under the MMSF model proposed in Section |2] we have that 
the generated hyper-graph G G mIGI'I'^IxI^I satisfies 

G := E[G|n] = (p 0 p)ni^, (11) 

where 0 denotes the Khatri-Rao product. Moreover, for a given resource r & R, the column vector 
G{r, {U,T}) has conditionally independent entries given the community membership vector iTr- If R C R 
is the set of (exactly) pure nodes, then the 3-star count defined in ([3]l satisfies 

'^R-^A,B,C ~ X!/ ® Hb 0 Hq), (12) 

i&[k] 
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where Wi is 


Wi ;= P[7rr(i) = l|r G ii], 

and Ha ■= ^u{A) © ^T{A)’ similarly, Hb and He- 

The above results follow from modeling assumptions in Section O and in particular, the conditional inde¬ 
pendence relationships among the different variables. For details, see Appendix lAl 

In (fTTl) . note that a if column of G{X; {U,T}) corresponds to a pure node x G X, then the matrix 
has rank of one, since tTj. corresponds to a coordinate basis vector. On the other hand, for the case where 
columns correspond to mixed nodes, the matrix has rank bigger than one. Thus, the rank criterion succeeds 
in identifying the pure nodes in X under exact moments. 


Lemma 3 (Correctness of the method under exact moments) Assume F Q F has full column rank, and 
Ily has full row rank, where Y G R is used for constructing the projection matrix, then the proposed method 
LearnMixedMembership in Algorithm\I}correctly learns the community membership matrix If. 


Proof: Using the form of the moments in Proposition [TJ we have that if r G .R is a pure node, then 

G{r; {U,T}) = {F Q F)7rr is rank one since it selects only one column of F © F. Thus, the rank 
test i n Algorithm [D succeeds in recovering the pure nodes. The correctness of tensor method follows 
from ( Anandkumar et al.L 2014a 1. □ 

Since we only have sampled graph G and not the exact moments, we need to carry out perturbation 
analysis, which is outlined below. 


5.2 Perturbation Analysis 

Recall that Proj = is the projection matrix corresponding to k-svd(G({U,r},y)) = MkAkV^. 

Define the perturbation between empirical and exact moments upon projection as 

rrix := \\RroiG{{U,T},x) - G{{U,T},x)\\, VxGA, eRank := max ||ma;||. (13) 

X 

The above perturbation can be divided into two parts 

||m,II < II Pmi{G{{U, T}, x) - G{{U, T}, x))|| + || (Pr'^j - Proj)G({U, T}, x)||. 

The first term is commonly referred to as distance perturbation and the second term is the subspace pertur¬ 
bation. We establish these perturbation bounds below. 

We begin our perturbation analysis by bounding nix as defined in Eqn. (fT3l) . 

Lemma 4 (Distance perturhation) Under the assumptions of Section H. 1\ with probability 1 — 5 , we have 
for all X G A, 


Pmi{G{{U, T}, x) - G{{U, T}, x))|| < Vkp 


1 + ^ (log(n/5))^ 


1/2 


for some constant G' > 0. 


See Appendix 1C. 1 1 and Appendix 1C.21 for details. Notice that the subspace perturbation dominates. 
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Lemma 5 (Subspace perturbation) We have the subspace perturbation as 


||(Proj-Proj)G({C/,r},x)|| < 2a-\liY)^J\\F Q F\\^. 
Under the assumptions of Section \4.1\ w.h.p. this reduces as 

See Appendix IC.2I 


5.3 Analysis of Rank Test 

Recall that from the perturbation analysis, we have bound CRank on th® error vector nix, defined in ([T3l) . We 
assume there exist no node such that maxj^jfc] TTx{i) is between the threshold given in (fT4l) and 1. We have 
the following result on the rank test. 


Lemma 6 (Conditions for Success of Rank Test) When the thresholds in Algorithm\I\are chosen 

0 < Ti < min ||(p[/)i|| • ||(-PT)i|| - CRank, T2 > CRank, 

I 


then all the pure nodes pass the rank test. Moreover, any node x ^ X passing the rank test satisfies 

Ti — r 2 — 2 CRank 

maXTTajlZ) > -=-. 

i&[k] maxi ||(Fc/)j|| • ||(Pr)i|| 


(14) 


Proof: See Appendix 1C.3l □ 

The above result states that we can correctly detect pure nodes using the rank test. The conditions stem 
from the fact that we require the top eigen-value to pass the test and the second top eigen-value to not 
pass the test. For a pure node, (Ti(mat(Proj G{{Ui,Ti},x))) is minj ||(P; 7 i)i|| • ||(.FTi)i|l- To account for 
empirical error, we consider CRank- In addition, the second-top eigen-value can be as small as 0. We also 
note the error in empirical estimation. This result allows us to control the perturbation in the 3-star tensor 
constructed using the nodes which passed the rank test. 


6 Conclusion 

In this paper, we propose a novel probabilistic approach for modeling folksonomies, and propose a guaran¬ 
teed approach for detecting overlapping communities in them. We present a more scalable approach where 
realistic conditional independence constraints are imposed. These constraints are natural for social tagging 
systems, and they lead to scalable modeling and tractable learning. While the original MMSB model as¬ 
sumes that the communities are drawn from a Dirichlet distribution, here, we do not require such a strong 
parametric assumption. Note that the Dirichlet assumption for community memberships can be limiting 
and cannot model general correlations in memberships. Here, we impose a weak assumption that a certain 
fraction of resource nodes are “pure” and belong to a single community. This is reasonable to expect in prac¬ 
tice. We establish that the communities are identifiable under these natural assumptions, and can be learnt 
efficiently using spectral approaches. Considering future directions, we note that social tagging assumes a 
specific structure. Therefore, it is of interest to extend this model to more general hypergraphs. 
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Appendix 

A Moments under MMSF model and Algorithm Correctness 

Proof of Proposition [H We have 

E[G({n, t], r)|7r^, TTt, 7r„] = E[E[G({u, t] ,r)\zr^{u,t}-,'^t-,T^r,T^u]] 

(c) "" 

= '^\FuZr^{u,t} ■ (15) 

where (a) and (5) are from the assumption © that 


where Br^u\t and Br^f,u are Bernoulli draws, which only depend on the contextual variables ^u^{r,t} 

and and therefore G{{u,t},r) — form a Markov chain. This also establishes that 

G{{u,t},r) and G({w',f'},r) are conditionally independent given the community membership vector tt^, 
for u ^ u' and t / t'. 

For (c), we have that 


E [Bf—^u\t I ^r—^{u,t} 


) TTn] = '^11— 


from ([ill and the fact that 




14 


Thus, we have 


E[G{{U,T},r)\7rr,IlT,Ilu] = E[F^ ^ 

= E[{F (g) F){Zr^{u,t} ® Zr^{u,t})\'^r] 

= ^ T^r{'i){F ® F){ei ® Ci) 
ie[fc] 

^^{FQF)Trr, 


where (a) follows from (fTSl) and (b) follows from the fact (|9l). (c) follows from the fact that Zr^{u,t} takes 
value Cj with probability 7rr{i), where e* E is the basis vector in the coordinate, (d) follows from the 
definition of Khatri-Rao product. 

The form of the 3-star moment is from the lines of (lAnandkumar et al.Ll2014al. Prop 2.1), and relies on 
the assumption that R consists of pure nodes. 

□ 


B Learning using Tensor Decomposition 


We now recap the tensor decomposition approach proposed in (lAnandkumar et al.L l2014ah here. This is 
shown in Algorithm [3] with modifications specific fo our framework. 

We partition [7, T info fhree sefs for fhe differenf fasks explained in fhe Algorifhm[3l Also nofe fhaf wifh 
knowledge of communify connecfivify mafrices, we can learn communify memberships for mixed resource 
nodes as well. 


Procedure 3 (If) •(— TensorDecomp(G, R) 

Lef P E be fhe communify connecfivify mafrix from user communities to resource communities 

and similarly P is connectivity from tag communities to resource communities. R are estimated pure 
resource nodes. Partition {f7, T} into {f/j, Tj} for i = 1, 2, 3. 

Compute whitened and symmetrized tensor T ^ ^ WbSab, WcSac)^ where A, B, C 

form a partition of {[ 72 , T 2 }. Use {Us, T 3 } for computing the whitening matrices. 

{A, <!>} ■(—TensorEigen(T, {Wj Ff). {^is akxk matrix with each columns being an estimated 

eigenvector and A is the vector of estimated eigenvalues.} 

^ Thres(Diag(A)-i8^fUjGj ,4 , r). 
return (If). 


C Perturbation Analysis: Proof of Theorems [H |2] 

Notation: For a vector v, let ||ti|| denote its 2-norm. Let Diag(f) denote a diagonal matrix with diagonal 

entries given by a vector v. For a matrix M, let (M)j and (M)* denote its P column and row respectively. 
Let IIM 111 denote column absolute sum and ||M||oo denote row absolute sum of M. Let denote the 
MoorePenrose pseudo-inverse of M. 


15 











Procedure 4 {A, <!>} •(—TensorEigen(r, N) (Anandkumar et al.,:2_014a) 

Input: Tensor T E ^ initialization vectors {vi}i(zL, number of iterations N. 

Output: the estimated eigenvalue/eigenvector pairs {A, <b}, where A is the vector of eigenvalues and d* is 
the matrix of eigenvectors, 
for i = 1 to do 
for r = 1 to L do 
Oq Vr- 

for f = 1 to A" do 
f ^T. 

for j = 1 to i — 1 (when i > 1) do 

if ^ then 

f^f- A,(/)f^ 

end if 
end for 

^ . /i(t) 

Compute power iteration update := — - C. Cs — 

end for 
end for 

Letr* := arg max^6i,{r(6'^\ 

(r*) 

Do N power iteration updates starting from 0]y ' to obtain eigenvector estimate (/>,, and set Xi := 

4’i)- 

end for 

return the estimated eigenvalue/eigenvectors (A, d)). 


C.l Distance Concentration: Proof of Lemma |4] 


The proof is along the lines of (lMcSherrvLl200ll. Theorem 13) but we apply Hanson-Wright bound in Propo¬ 
sition |5] to get a better perturbation guarantee without the need for constructing the so-called combinatorial 
projection, as in ( McSherrv . 200 ih . 

We have := G{x; {C, T}) — G{x\ {U,T}) and let = max* K[hx{i)‘^\7r'x]- Note the simple fact 


II Proj hxf = hx Proj^ hx = h], Proj hx, 

since Proj is a projection matrix. From Proposition [H we have that the entries of hx are conditionally 
independent given tTx- Thus, the Hanson-Wright inequality in Proposition [5] is applicable, and we have with 
probability 1 — <5, for all x G X, 

h~l Proj hx < E[/iJ Proj hx\TTx\ + G'a^W Proj ||f (log(n/5))^ (16) 

Now II Proj ||f < \/fc|| Proj || = \/fc. The expectation is 

E[/iJ Proj hx\'Kx\ < tr(Proj)cJ^ = fccr^, 

using the property that Proj is idempotent. Thus, we have from (fThl) . with probability 1 — 5, for all x ^ X, 

/ij Proj hx < ka‘^ + G'Vka"^ (log(n/5))^ , 
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and we see that the mean term dominates and the bound is 0{ka‘^). 

Draw random variables 

~ Bernoulli 

Br^f^u ~ Bernoulli(2;jl^^^_^jPz^^{„_t}). 

The presence of hyper-edge G{{u, t},r) is given by the product 

G*({u,f},r) — ■ Bj-—^f-u. 

The variance is on lines of proof of Lemma [TOl and we repeat it here. 

maxE[/ia;(i) {t^x] — max ^\^Bx—^u;tBx—^t-,u ((P© B'jTTx^ut] 

I ueu,v€V 

< max {{F Q F)7:x)ut, 

uGU^vGV 

fe[fc] 

- Yl P{hj)P{i,j)'^xij) 

jG[fe] 

< 

— max 

C.2 Proof of Lemma |5] 

From Davis-Kahan in Proposition we have 

\\{F^i-I)G{{U,T},Y)\\<2\\d{{U,T},Y)-G{{U,T},Y)\\. 

and thus 

IKPr'^j -i)G({p,r},x)|| < 2||G({p,r},y) - g {{ u , t }, y )\\ • \\ g {{ u , t }, y )^ • G({p,r},x)|| 

Now, 

G{{U, T}, y)t = ((P 0 P)ny)^ = nj.(p 0 p)t, 

since the assumption is that P 0 P has full column rank and Fly has full row rank. Thus, we have 

G{{U, P}, y)t . G{{U, T},x) = n^(P 0 F)\F Q F)ttx = • vr,, 

since (P 0 P)t(P 0 P) = / due to full column rank, when \U\ and |r| are sufficiently large, due to 
concentration result from Lemma [H] Note that under assumption A3, the variance terms in Lemma fTT] are 
decaying and we have that P 0 P has full column rank w.h.p. From Lemma [TOl we have the result. 

C.3 Analysis of Rank Test: Lemma |6] 

Consider the test under expected moments G := E[G|n]. For every node x G X {Ris randomly partitioned 
into X, Y), which passes the rank test in Algorithm [T] by definition, 

II mat(G({P,T},x))|| > Ti, and (T 2 (mat(G({P, T}, x))) < r 2 . 
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We use the following approximation. 


||Fi|| ~ - g)2||n*P + nq^ + 2 {p - g')g||n*||i 

Recall the form of G from Proposition [T] 

mat {G{{U,T},x)) = FuBiag{7rx)F]^ ■ 

First we consider the case, p q. Following lines of Anandkumar et aLl ( 2014h l. we have that 

1*71 '^max^ (—-1“ 9)^1 ^ l|-®ll + fRank 


where 


(V + '?) I 


(^+^) /■ 


Hence, we have that 


<72 > 7r2,maxn(^^-^ + qf - ||F^|| - CRank “(l/A) fRank -VTs^maxnp^HEfTTyr"^] 


where we assume Tr^ax > (1 + F)7r2,max and p := — 11^ ^ 

We note that eRank dominates ||S|| and the last term. Therefore, 


\\E\\ 


7r2,maxn(^-ir^+g)^ ' 


T2 - eRank > Cr 2 {mat{G{{U,T}, x))) > Vr2,maxR(^^-^ + q)'^ - (1 + 1/f) eRank, 


and 


ri + eRank <\\Fu Diag(7r3;)F^|| 

^ "^max max||(F(7j)i|| • ||(.FT)i|| + ^^2 ,max^ i^ + qf 

l rC 

^ '^max max\\{Fu)i\\ ■ ||(FT)i|| + T 2 + l//ieRank- 

I 

Combining we have that any vector which passes the rank test satisfies 

. Ti - T2 + (1 - l/jl) eRank 

TT-rrinx ^ " • 

- max,||(Ff;),||-||(FT),|| 

Now, for the case where q < p/k, the bound on ||F^|| is almost 0, //r ~ 1 and //r = 0. Hence Eqn. (1C.31 ) 
always holds. This is intuitive as the role of q is to make the components non-orthogonal, i.e., q acts as 
noise. Therefore, smaller q results in better guarantees. 

With \U\ = \T\ = 0(n), and using the concentration bounds in LemmadH we have that with probability 

1 -<5, 

\\{FuU • ||(Ft)*|| = O (vW^||E[7r7r^]|| ■ {p - q + Vkq)) 
assuming homegenous setting. 
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For eRank^ the subspace perturbation dominates. From Lemma[TTJ we have 

Thus, we have the subspace perturbation from Lemma |5] as 

° ' (V+")) ■ 

Substituting for the condition that ri = r2(eRank)5 we obtain assumption (l5]l. Thus, the rank test succeeds in 
this setting. 



C.4 Perturbation Analysis for the Tensor Method 


This is along the lines of analysis in (lAnandkumar et al.L l2014ah . However, notice here due to hypergraph 
setting, we need to redo the individual perturbations. Recall that Wi := P[i = arg max^ 7r(j)|7r is pure] and 
p = P[7r is pure]. The size of recovered set of pure nodes R = Q{np), assuming np > 1. 

We provide the perturbation of the whitened tensor. Let := WJ^Ha Diag(??)^/^ be the eigenvectors 
of the whitened tensor under exact moments and A := Diag(r/)~^/^ be the eigenvalues. S, S respectively 
denote the exact and empirical symmetrization matrix for different cases based on their subscript. 

Lemma 7 (Perturbation of whitened tensor) Wh have w.h.p. 


er ■■= 


i£[k] 


=0 


(- 


p 


\y/npwrain ' {p “ ' o-fc(E[vr7rT]) 

Proof: Let T := E[TlnA,B,c]- 

ei := \\f{WA, WbSab, WcSac) - T{Wa, WbSab, WcSac) 
€2 := WnWA, WbSab, WcSac) - UWa, WbSab, WcSac) 

For ei, the dominant term in the perturbation bound is 




iCY 


= O 


1 1 

Wmin 1.R1 


Y,iwJ{GA,i-HATTi 


The second term is 




€2 < 


ew 


V^min ’ 


(17) 
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since due to whitening property. 


Now imposing the requirement that 

Cj < 0 (Aminr ) , 



and using LemmaHl we have 


ety < 



< 1, 


{P - ^ V't^max 

P 'U^min 


1 

y/np • (Tfc(E[7r7rT]) ■ 


□ 


Lemma 8 (Whitening Perturbation) We have the perturbation of the whitening matrix Wa as w.h.p. 

P 


ew := II Biagiw^/^HjiWA - WAi)|| = o(- 

\' 


^y/npwmin ■ {p - qY ■ o-fc(E[7r7rT])y ■ 

Proof: From ( Anandkumar et all 2014a . Lemma 17), the whitening perturbation under the tensor method 
is given by 


:= II VPmgiwf/^HliWA - VFa)|| = O 

Using the bounds from Section ICTSl we have 


ec 


a-rnmiG 


ec := \\G{{U,T},R) - G{{U,T},R)\\ = 0{G\\FQF\f) = 0{n 


p-q 

k 


+ q 


and 


a'mm{G ^ l-^l^min ' <7min 

= U [y/n ■ pWann ■ <7min(-f^yl)) • 

From Lemma [13 we have 

a^iniHA) = CFmmiFA © Fa) = 9. [ n{p - qf min (Efvrf] - ElvTiTr^]) ) . 

V *47^* / 

Finally note that fTfc(E[7r7r~'']) = 0 (minj^yj (lE[vr?] — E[7rj7rj])). Substituting we have the result. □ 

Let liz be the reconstruction after the tensor method (before thresholding) on resource subset Z C R—R 
(we do not incorporate R to avoid dependency issues), i.e. 


fiz := Diag(A)-^4>^lUjG^ 




Lemma 9 (Reconstruction of communities (before thresholding)) We have w.h.p. 


:= max IKlIz)* - (nz)*|| = -^||nz|| = O ( • Vn||E[7r7r ' ]|| ) . 


i^Z 


Vk 


er 


Ti 


'/k 


Proof: This is on lines of ( Anandkumar et all 2014aL Lemma 13). 


(18) 

□ 
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C.5 Concentration of Graph Moments 


Lemma 10 (Concentration of hyper-edges) With probability 1 — <5, given community membership vectors 

n, 


eG:= ||G({[/,r},y)-G({[/,r},y)|| =O(max(V||F0F||i,V||(P*P)ny||oo)) 


Remark: When number of nodes n is large enough, the first term, viz., y ||F 0 F||i dominates. 


Proof: The proof is on the lines of ( Anandkumar et all 2013 . Lemma 22) but adapted to the setting of 


hyper-adjacency rather than adjacency matrices. Letm^ := G{{U,T},y) — G{{U,T},y) and My := myC, 


T 


and thus 


G({C/,T},y) - G{{U,T},Y) = Y.My, 


Note that the random matrices My are conditionally independent for y G L since my are conditionally inde¬ 
pendent given TTy, and in each vector my, the entries are independent as well. We apply matrix Bernstein’s 
inequality. We have E[My|n] = 0. We compute the variances |n] and E[My'^My|n]. 

We have that E[MyMy'^ |n] only the diagonal terms are non-zero due to independence, and 


E[MyMT |n] < Diag((F 0 FjvTy) 


(19) 


entry-wise, assuming Bernoulli random variables. Thus, 


^ E[MyMT|n]|| < E F'iu,j)F{t,j)7ry{j) 

yeY ’ y£Y,je[k] 


E Fiu,j)F{t,j)UY{j,y) 


u^u,teT 


yGY,je[k] 


<-max V P{i,j)P{i,j)nY{j,y) 

i£\k] , 

yGYjG[k] 

= WiP * P)Uy\\oo, 


( 20 ) 


where * indicates Hadamard or entry-wise product. Similarly E[Mj^My] = ^^^y Diag(E[mJmy]) < 

||(P * Pjlly jjoo- From Lemma nn we have a bound ||(P * Pjlly jjoo- 

We now bound ||My|| = ||my|| through vector Bernstein’s inequality. We have for Bernoulli G, 


max \G{{u, t}, y) - G{{u, t} 

u£U,tGl 


5 y /1 — 


< 2 


and 


E[Gi{u,t},y)-G{{u,t},y)]^ < ((F 0 FjvTy),* < ||F 0 F||i. 

ueu,teT u£U,teT 

Thus with probability 1 — 5, we have 

IlMyll < (1 + s/8log{l/6))^\\F 0 Fill- + 8/3log(l/5). 


Thus, we have the bound that || Ffyjj = 0(max(A / ||F 0 Fjji, y ||(F * Fjlly jjoo))- 


□ 
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For a given (5 G (0,1), we assume that the sets [/, T and Y <Z R we large enough to satisfy 

x/TcGm> 

Lemma 11 (Concentration bounds) With probability 1 — 5, 

||i" © Fill < |(/| • |r| max(P • EH), max(P • EH)^ + ^^|(/| • |r| • • log 

|||(P©P),|| < max||nf/|| • linril • ||P|| • ||P|I 

i 

= O (^V\U\ ■ |r|||E[7r7r’^]|| • {p - q + Vkq)^ , 
for the homogeneous setting. Similarly for subset Y <Z R, we have 

llnynf II < |F| • ||E[7r7r^]|| + ^^|y| • ||E[7r7rT]p . log ^ 

C7fc(nynf) > |F| • aPElvTTT^]) - |y| • ||E[7r7rP f • log ^ 

||(P * P)^ny Iloo < |y| max(E[7r]^ • (P * P)), + ^^|L| • P",, • log ^ 


Remark: Note that <t(P) = Q{p — q) and ||P|| = Q{p + q) for homogeneous P. Under Assumption A3, 

the varianee terms are small and the above quantities are elose to their expeetation. 


Proof: To bound on ||P © P||i, we note that ||E[P © P]||i < |P| • |r| maxpP''' • E[7r])j(P''' • E[7r])i. 

Using Bernstein’s inequality, for eaeh eolumn of P © P, we have, with probability 1 — 5, 


||(P © P),||i - |P| • ITKEH, (P),)(E[7r], (P),) 


< 


^|PHr|-p4,,-iogl^^^^ 


by applying Bernstein’s inequality, sinee (tt, (P)j)(7r, (P),) < maxj(P 7r)i(P vr), < Pmax) 


max ||E[(P)77r„7r;[(P),]-E[(P)77ri7r7(P),]||, ||E[7r;[(P),(P)77r.] • E[7r7(P),(P)7H || 

\u&U,t&T u&U,t€T 

<|PHr|-pl,. 


The other results follow similarly. 


□ 


The lowest singular value for the Khatri-Rao produet is a bit more involved and we provide the bound 
below. 

Lemma 12 (Spectral Bound for KR-product) 

aliF © P) > |P| • \T\akir * T) - ^^|P| • |r| • ||P||2 • ||P||2 . ||E[7r7rT]p . log 
where T := P^EpTr^jP and * denotes Hadamardproduct. 


22 



Proof: The result in the Lemma follows directly from the concentration result. For the homogeneous 

setting, we have for a matrix F, 

crfc(r * F) = © ( minr(i,— maxr(i,y)^ 

V * 

Substituting we have the result. □ 



Remark: For the homogeneous setting, with P = P having p on the diagonal and q on the off-diagonal, 

we have 


r = 


where u is a vector where 
following bound 


{P - q)I + lE[7r7r'''] {p - q)I + qll' 

{p — q)‘^¥,['KTi^] + 2{p — q)qvl^ + g^||E[7r7r''" 

Vi = ||E[7r7r 111 , where denotes the P row of M. Thus, we have the 


,11 


T 


crfc(r *r) = (^mm(r(z,i)^ -T{i,jf)\ 

= 0 - q)^ min (E(7r2) - E[7ri7rj])^^ , 

assuming that Elvr^^] — ElvTjvrj] = 0(E[7r|]) for all i 7 ^ j, and the other terms which are dropped are positive. 
Thus, we have w.h.p. 

ak{F 0 F) = 11 (n{p - qf min (Efvrf] - EfvrjTTj]) ) (21) 

V / 


D Standard Matrix Concentration and Perturbation Bounds 


D.l Bernstein’s Inequalities 


One of the key tools we use is the standard matrix Bernstein inequality (ITroppl 120121 thm. 6.1, 6.2). 


Proposition 2 (Matrix Bernstein Inequality) Suppose Z = Wj where 

1. Wj are independent random matrices with dimension di x d 2 , 

2. E[VFj] = 0/or all j, 

2- II Wj < R almost surely. 

Let d = di+ d 2 , and cP = max 11| Yj '^\WjWj ], || Yj Wj] |, then we have 


Pr|||Z>(|<d.exp|^j/^} 

<d-expl^-^^, t<a‘^/R, 

t>a^lR 
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Proposition 3 (Vector Bernstein Inequality) Let z = {zi,Z 2 ,---,Zn) G be a random vector with in¬ 
dependent entries, IE[ 2 ;j] = 0, £[ 2 ;?] = af, and Pr[| 2 ;j| < 1] = 1. Let A = [ai|a 2 | • • • \an] G be a 

matrix, then 


n 


Pr[|| Az < (1 V^) 




E 


+ (4/3) max 

* *G[n] 


ciit] > 1 — e *. 


D.2 Hanson-Wright Inequalities 


We require the Hanson-Wright inequality (IRudelson and VershvninLl2013l) . 


Proposition 4 (Hanson-Wright Inequality: sub-Gaussian bound) Let z = (zi, Z 2 , G 'MA be a 

random vector with independent entries, E[zj] = 0 and Pr[|zi| < 1] = 1 and let M G be any matrix. 
There exists a constant c> 0 s.t. 


Pr 


|z'Mz-E(z^Mz)| >t 


< 2 exp 


—cmin ( 




t 


Vl|M||i’||M|| 


Unfortunately the sub-Gaussian bound is not strong enough when z has small variance cr^. In this case, 
we get the perturbation as 0 (||M||f) instead of 0 ((t||M||f), which is desired. This is because for a bounded 
random variable, the sub-Gaussian parameter only depends on the bound and not on the variance. 


We will consider an ext ension of the Hanson-Wright inequality to sub-exponential random variables (lErdos et al 


20 121: IVu and Wangl 12013h and employ the s ub-exponential fo rmulation for bounded random variables. We 
first define sub-exponential random variable ( Vershvnin . 2010L Definition 5.13). 


Definition 1 (Sub-exponential Random Variable) A zero-mean random variable X is said to be sub¬ 
exponential if there exists a parameter K such that E[e^/^] < e. 


Remark: There are other equivalent notions for sub-exponential random variables ( Vershvnin . 2010L Def¬ 
inition 5.13), but this will be the convenient one for proving sub-exponential bound for Bernoulli random 
variables. It is easy to see that the centered Bernoulli random variables are sub-exponential for some constant 
K. 

W e will employ the following version of Hanson-Wright’s inequality for sub-exponential random vari¬ 
ables ( Erdos et al. . 2012 . Lemma B.2). 


Proposition 5 (Hanson-Wright Inequality: sub-exponential bound) Let z = (zi, Z 2 ,..., z„) G MA be 

a random vector with independent entries, E[zi] = 0, E[z/] < and zi are sub-exponential and let 
M G be any matrix. There exists constants c,C > 0 s.t. 


Pr 


z^Mz - E(z ' Mz)| > ta^\\M\\i 


< C exp 




Rema rk: The result in the form above appears in (IVu and Wangll2013L (13)) and we set a = 1 in (IVu and Wang . 
201 31 (13)). The parameter C above differs from the sub-exponential parameter K by only a constant factor. 

Comparing sub-exponential formulation in Proposition |5] with sub-Gaussian formulation in Proposi- 
tion|4j we see that in the former, the deviation is 0(||M||FfT), while in the latter it is only 0 (||M||f). 

Thus, for centered Bernoulli random variables and we can employ Proposition [5l and we will use it for 
distance concentration bounds. 
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D.3 Davis-Kahan Inequality 

We also use the standard Davis and Kahan bound for subspace perturbation. 


Proposition 6 (Davis and Kahan) For a matrix A, let Proj be the projection matrix on to its top-k left 
singular vectors. For any rank-k matrix A, we have 


IKProj -I)A\\<2\\A-A\\ 


Proof: This is directly from (IMcSherryl.l200ll. Lemma 12). By writing A = A — {A — A), we have 

IKPr'^j -I)A\\ < IKPr'^j -I)A\\ + IKPi^j -I)(i - 21)11, 


and each of the terms is less than ||^ — A||. For the first term, it is because Proj A is the best rank-A; 
approximation of A and since A is also rank k, the residual ||(Proj —/)^|| < \\A — 74||. For the second 
term, ||(Proj —I){A — A)|| < \\A — ^|| since (Proj —I) cannot increase norm. □ 


References 

Edoardo M. Airoldi, David M. Blei, Stephen E. Eienberg, and Eric P. Xing. Mixed membership stochastic 
blockmodels. Journal of Machine Learning Research, 9:1981-2014, June 2008. 

A. Anandkumar, R. Ge, D. Hsu, and S. M. Kakade. A Tensor Spectral Approach to Eearning Mixed Mem¬ 
bership Community Models. In Conference on Learning Theory (COLT), June 2013. 

Animashree Anandkumar, Rong Ge, Daniel Hsu, and Sham M Kakade. A tensor approach to learning mixed 
membership community models. The Journal of Machine Learning Research, 15(1):2239-2312, 2014a. 

Animashree Anandkumar, Rong Ge, and Majid Janzamin. Guaranteed non-orthogonal tensor decomposition 
via alternating rank-1 updates. arXiv preprint arXiv:1402.5180, 2014b. 

Michael Brinkmeier, Jeremias Werner, and Sven Recknagel. Communities in graphs and hypergraphs. In 
Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, 
pages 869-872. ACM, 2007. 

Abhijnan Chakraborty and Saptarshi Ghosh. Clustering hypergraphs for discovery of overlapping commu¬ 
nities in folksonomies. In Dynamics On and Of Complex Networks, Volume 2, pages 201-220. Springer, 
2013. 

Abhijnan Chakraborty, Saptarshi Ghosh, and Niloy Ganguly. Detecting overlapping communities in folk¬ 
sonomies. In Proceedings of the 23rd ACM conference on Hypertext and social media, pages 213-218. 
ACM, 2012. 

Easzlo Erdos, Homg-Tzer Yau, and Jun Yin. Bulk universality for generalized wigner matrices. Probability 
Theory and Related Fields, 154(1-2):341^07, 2012. 

E. Huang, U.N. Niranjan, M. Hakeem, and A. Anandkumar. East Detection of Overlapping Communities 
via Online Tensor Methods. ArXiv 1309.0787, Sept. 2013. 


25 





Stefanie Jegelka, Suvrit Sra, and Arindam Banerjee. Approximation algorithms for tensor clustering. In 
Algorithmic learning theory, pages 368-383. Springer, 2009. 

loannis Konstas, Vassilios Stathopoulos, and Joemon M Jose. On social networks and collaborative recom¬ 
mendation. In Proceedings of the 32nd international ACM SIGIR conference on Research and develop¬ 
ment in information retrieval, pages 195-202. ACM, 2009. 

Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi Konuru, Hari Sundaram, and Aisling Kelliher. Metafac: com¬ 
munity discovery via relational hypergraph factorization. In Proceedings of the 15th ACM SIGKDD 
international conference on Knowledge discovery and data mining, pages 527-536. ACM, 2009. 

F. McSherry. Spectral partitioning of random graphs. In FOCS, 2001. 

Tsuyoshi Murata. Detecting communities from tripartite networks. In Proceedings of the 19th international 
conference on World wide web, pages 1159-1160. ACM, 2010. 

Nicolas Neubauer and Klaus Obermayer. Towards community detection in k-partite k-uniform hypergraphs. 
In Proceedings of the NIPS 2009 Workshop on Analyzing Networks and Learning with Graphs, pages 1-9, 
2009. 

Symeon Papadopoulos, Yiannis Kompatsiaris, and Athena Vakali. A graph-based clustering scheme for 
identifying related tags in folksonomies. In Data Warehousing and Knowledge Discovery, pages 65-76. 
Springer, 2010. 

Mark Rudelson and Roman Vershynin. Hanson-wright inequality and sub-gaussian concentration. arXiv 
preprint arXiv:1306.2872. 2013. 

J.A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathe¬ 
matics, 12(4):389-434, 2012. 

Alexei Vazquez. Finding hypergraph communities: a bayesian approach and variational solution. Journal 
of Statistical Mechanics: Theory and Experiment, 2009(07):P07006, 2009. 

Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint 
arXiv:1011.3027, 2010. 

Van Vu and Ke Wang. Random weighted projections, random quadratic forms and random eigenvectors. 
arXiv preprint arXiv:1306.3099, 2013. 

Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu. Discovering overlapping groups in social media. In Data 
Mining (ICDM), 2010 IEEE 10th International Conference on, pages 569-578. IEEE, 2010. 

Shengliang Xu, Shenghua Bao, Ben Eei, Zhong Su, and Yong Yu. Exploring folksonomy for personal¬ 
ized search. In Proceedings of the 31st annual international ACM SlGlR conference on Research and 
development in information retrieval, pages 155-162. ACM, 2008. 

Jaewon Yang and Jure Eeskovec. Overlapping community detection at scale: A nonnegative matrix fac¬ 
torization approach. In Proceedings of the sixth ACM international conference on Web search and data 
mining, pages 587-596. ACM, 2013. 

Chen Yudong, Sujay Sanghavi, and Huan Xu. Clustering sparse graphs. \n Advances in Neural Information 
Processing Systems 25, 2012. 


26 


